ADD BRIEF DESCRIPTION OF WHAT THIS NOTEBOOK IS
(SAMPLE) This notebook contains a set of baseline linear models created to predict Budget_Change_Ratio and Schedule_Change_Ratio using the small set of valid predictors available in our original dataset (therefore, no engineered features were used here.
The code and results given in this notebook are only a summary of the work completed toward this analysis. Supplemental notebooks containing the complete EDA, data cleansing, feature engineering, and model exploration illustrated in this report can be found in the notebooks/ directory of the supporting GitHub project repository. The notebooks in that repository are designed to be run in sequential numbered order to reproduce the results shown here. In addition, throughout this report, we will provide links to specific supporting notebooks for further reference.
As you will see in the imports section of this notebook, there are a number of custom python modules used to generate the results shown throughout this notebook. The code for these modules are stored separately in the src/ directory of the supporting GitHub repository for better source code version control, reproducing output across multiple notebooks, and to keep the length of this report to a manageable length.
Given the set of New York City Capital Projects change data, can we create a model that can accurately predict 3-year change in forecasted project budget and 3-year change in forecasted project duration using only the data available at the start of the project as our predictors?
In other words, using historical project data, can we predict how much the forecasted budget and duration of any given capital project run by the City of New York will deviate from it's original budgeted estimates by the end of year-3 for the project?
The significance of a model that can accurately address this question means, given any new project, project managers and city administrators could another tool at their disposal for objectively identifying potential budget and schedule risk at the start of a new city-run capital project. Such a tool can help to overcome common planning fallacies and ..... (TO ADD ADDITIONAL THEORETICAL TERMS FROM ACADEMIC PAPERS)
ADD AN EXECUTIVE SUMMARY OF OUR FINDINGS HERE (5-6 key points/findings)
SEE THE BASELINE LINEAR MODELS NOTEBOOK FOR AN EXAMPLE
Budget_Start and Duration_Start predictor data sufficiently minimizes skew to provide the best predictive performance amongst our linear models.Budget_Change_Ratio and Schedule_Change_Ratio, both exhibit different predictive behavior.Budget_Change_Ratio better than a naive model, as is evidenced by negative $R^2$ test score, our Schedule_Change_Ratio predictions do moderately well with our best test $R^2$ score exceeding 0.54.Budget_Change_Ratio and Schedule_Change_Ratio demonstrated enough improvement in our $R^2$ scores to offer hope that a combination of additional engineered features and a sufficiently expressive model, might give us some form of acceptable predictive performance."BASELINE+" Smoothing Spline GAM results with optimized term penalties by response variable and a label-encoded `Category` predictor:
['Budget_Start', 'Duration_Start', `Category_Code`]
R-squared scores:
Budget_Change_Ratio
Training 0.8263
Test -0.8736
Schedule_Change_Ratio
Training 0.5800
Test 0.5373
from IPython.display import HTML, Image, IFrame, Markdown
HTML(
'''<script> code_show=true; function code_toggle() {
if (code_show){
$('div.input').show();
} else {
$('div.input').hide();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()">
<input type="submit" value="Click here to toggle on/off the raw code.">
</form>'''
)
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# general regressor, scaling, and scoring sklearn imports
from sklearn.ensemble import AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.decomposition import PCA
from sklearn.metrics import r2_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler, \
RobustScaler, LabelEncoder
# clustering specific imports
from sklearn.cluster import KMeans
from gap_statistic import OptimalK
import scipy.cluster.hierarchy as hac
from scipy.spatial.distance import pdist
from sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import silhouette_samples, silhouette_score
# statsmodels and pygam imports
import statsmodels.formula.api as sm
from pygam import LinearGAM, s, f
# import custom .py functions from src/ directory
sys.path.append('..')
from src.datagen import print_interval_dict
from src.scale import scale_features, sigmoid, log_plus_one, encode_categories
from src.model import generate_model_dict, print_model_results
from src.visualize import plot_true_pred, plot_bdgt_sched_scaled, \
plot_change_trend, plot_gam_by_predictor, \
plot_line, plot_value_counts
from src.cluster import silplot, display_gapstat_with_errbars, \
fit_neighbors, plot_epsilon, silscore_dbscan, \
fit_dbscan, print_dbscan_results
# Avoid scientific notation output in Pandas
# pd.set_option('display.float_format', lambda x: '%.2f' % x)
pd.options.display.float_format = '{:,.2f}'.format
# Improve resolution of output graphics
%config InlineBackend.figure_format ='retina'
# establish file paths and check to ensure target files exist
filepath_train = '../data/processed/NYC_capital_projects_3yr_final_train.csv'
filepath_test = '../data/processed/NYC_capital_projects_3yr_final_test.csv'
filepath_full = '../data/interim/Capital_Projects_clean.csv'
filepath_allyears = '../data/interim/NYC_capital_projects_all.csv'
error = None
for filepath in [filepath_train, filepath_test, filepath_full, filepath_allyears]:
if not os.path.isfile(filepath):
error = 1
print(
"ERROR - the following target file does not exist:\n\n\t{}\n"\
"".format(filepath)
)
if error is None:
print("OK - all filepaths point to existing files!")
# load dataframes from target files and set datetime columns
data_train = pd.read_csv(filepath_train)
data_test = pd.read_csv(filepath_test)
data_full = pd.read_csv(filepath_full)
data_allyears = pd.read_csv(filepath_allyears)
datetime_cols = [
'Design_Start',
'Final_Change_Date',
'Schedule_Start',
'Schedule_End',
]
for col in datetime_cols:
data_train[col] = pd.to_datetime(data_train[col])
data_test[col] = pd.to_datetime(data_test[col])
data_allyears[col] = pd.to_datetime(data_allyears[col])
def print_record_project_count(dataframe, dataset='full'):
"""Prints summary of records and unique projects in dataframe
"""
if dataset=='full':
print(
'For the ORIGINAL cleansed data, containing all available NYC capital '\
'projects change records:\n'
)
elif dataset=='all':
print(
'For the data containing start and end data for all available '\
'NYC capital projects for the ENTIRE INTERVAL of changes '\
'covered in the ORIGINAL data:\n'
)
else:
print(
'For the final {} data, containing the {} split of 3-year '\
'project data used in this analysis:\n'.format(
dataset.upper(), dataset
)
)
# entries
print(f"\tNumber of dataset records: {len(dataframe)}")
# num projects
print(
f"\tNumber of unique projects in dataset: {dataframe['PID'].nunique()}\n"
)
print_record_project_count(data_full, 'full')
print_record_project_count(data_allyears, 'all')
print_record_project_count(data_train, 'training')
print_record_project_count(data_test, 'test')
print(
'The shapes of our loaded train-test splits are:\n\n'\
'\tTrain\t{}\n\tTest\t{}\n'.format(
data_train.shape, data_test.shape
)
)
print(
'For future reference, here is an overview of the features '\
'contained in our final TRAINING dataset:\n'
)
data_train.info()
print('\n\n...and, here are the first 3 rows of our TRAINING data:')
data_train.head(3)
The unabridged notebooks used to generate the findings in this section can be found here on GitHub and here on GitHub.
IN THIS SECTION:
3.1. Reference class clustering with Kmeans and UMAP
3.2. Embedding project description text with Bert
3.3 Encoding the Bert embedding with PCA, Autoencoders, and UMAP
IN THIS SECTION:
The unabridged notebook used to generate the findings in this section can be found here on GitHub.
The unabridged notebook used to generate the findings in this section can be found here on GitHub.
from dataclasses import dataclass
from pickle import dump, load
import plotly.io as pio
import plotly.express as px
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from IPython.display import Image, SVG
init_notebook_mode()
pio.renderers.keys()
pio.renderers.default = 'jupyterlab'
scaler = load(open("../data/interim/UMAP_fitted_robust_scaler.pkl", "rb"))
clusterer = load(open("../data/interim/UMAP_fitted_HDBSCAN.pkl", "rb"))
mapper_dict = load(open("../data/interim/UMAP_mapper_dict", "rb"))
final_cols = load(open("../data/interim/UMAP_final_cols", "rb"))
embedding_file = "../data/processed/embeddings_uncased_L-2_H-128_A-2.csv"
bert_embedding = pd.read_csv(embedding_file).round(2)
embedding_expanded = bert_embedding["embedding"].str.split(",", expand=True)
embedding_expanded.columns = [f"description_embedding_original" for num in embedding_expanded.columns]
bert_embedding = bert_embedding.join(embedding_expanded)
bert_embedding = bert_embedding.drop_duplicates("PID")
bert_embedding = bert_embedding.drop(columns="embedding").reset_index().drop(columns="index")
class UMAP_embedder():
def __init__(self):
#self.initial_columns = columns
self.initial_columns = ['PID', 'Project_Name', 'Description', 'Category', 'Borough', 'Managing_Agency', 'Client_Agency', 'Phase_Start', 'Current_Project_Years', 'Current_Project_Year', 'Design_Start', 'Budget_Start', 'Schedule_Start', 'Final_Change_Date', 'Final_Change_Years', 'Phase_End', 'Budget_End', 'Schedule_End', 'Number_Changes', 'Duration_Start', 'Duration_End', 'Schedule_Change', 'Budget_Change', 'Schedule_Change_Ratio', 'Budget_Change_Ratio', 'Budget_Abs_Per_Error', 'Budget_Rel_Per_Error', 'Duration_End_Ratio', 'Budget_End_Ratio', 'Duration_Ratio_Inv', 'Budget_Ratio_Inv']
#self.scale_cols = df_to_transform.columns
self.scale_cols = ['Current_Project_Years', 'Current_Project_Year', 'Budget_Start',
'Final_Change_Years', 'Budget_End', 'Number_Changes', 'Duration_Start',
'Duration_End', 'Schedule_Change', 'Budget_Change',
'Schedule_Change_Ratio', 'Budget_Change_Ratio', 'Budget_Abs_Per_Error',
'Budget_Rel_Per_Error', 'Duration_End_Ratio', 'Budget_End_Ratio',
'Duration_Ratio_Inv', 'Budget_Ratio_Inv']
self.scaler = scaler
self.cols_to_dummify = ['Borough', 'Category', 'Client_Agency', 'Managing_Agency',
'Phase_Start', 'Budget_Start', 'Duration_Start']
#self.cols_to_dummify = columns_before_dummified
self.final_cols = final_cols
self.mapper_dict = mapper_dict
self.clusterer = clusterer
self.embedding = bert_embedding
def get_mapping_attributes(self,df, return_extra=False, dimensions="all"):
"""
if return extra = True, returns 3 objects:
0. mapping
1. columns needed to be added to harmonize with entire data
2. dummified df before adding columns of [1]
"""
raw_df = df[test_class.initial_columns]
df_to_transform = df[self.scale_cols]#.drop(columns=["PID"])
transformed_columns = pd.DataFrame(self.scaler.transform(df_to_transform), columns = df_to_transform.columns)
scaled_df = (df[df.columns.difference(transformed_columns.columns)]).join(transformed_columns)
dummified = pd.get_dummies(scaled_df[self.cols_to_dummify])
added_cols = all_cols_all_datasets - set(dummified.columns)
added_cols = {col: 0 for col in added_cols}
dummified_full = dummified.assign(**added_cols)
dummified_full = dummified_full[self.final_cols]
mapping_df_list =[]
mapper_list = mapper_dict["attributes"].values() if dimensions == "all" else [mapper_dict["attributes"][dimension] for dimension in dimensions]
for mapper in mapper_list:
mapping = mapper.transform(dummified_full)
mapping_df = pd.DataFrame(mapping, columns= [f"umap_attributes_{mapping.shape[1]}D_embed_{col+1}" for col in range(mapping.shape[1])])
mapping_df_list.append(mapping_df)
final_df = pd.concat(mapping_df_list, axis=1)
final_df["PID"] = scaled_df["PID"]
if return_extra:
return final_df, added_cols, scaled_df, dummified
else:
return final_df
def get_mapping_description(self, df, dimensions= "all"):
merged = df[["PID"]].merge(self.embedding, on = "PID", how="left").drop(columns="PID")
mapping_df_list =[merged]
#mapping_columns = [list(self.embedding.columns.copy())]
mapper_list = mapper_dict["description"].values() if dimensions == "all" else [mapper_dict["description"][dimension] for dimension in dimensions]
for mapper in mapper_list:
mapping = mapper.transform(merged)
mapping_df = pd.DataFrame(mapping, columns= [f"umap_descr_{mapping.shape[1]}D_embed_{col+1}" for col in range(mapping.shape[1])])
mapping_df_list.append(mapping_df)
# mapping_columns += list(mapping_df.columns.copy())
final_df = pd.concat(mapping_df_list, axis=1)
final_df["PID"] = df["PID"].values
return final_df
def get_full_df(self, df, dimensions="all"):
attribute_df = self.get_mapping_attributes(df,dimension="all")
description_df = self.get_mapping_description(df)
labels, probabilities = self.get_clustering(attribute_df[["umap_attributes_2D_embed_1", "umap_attributes_2D_embed_2"]])
full_df = description_df.merge(attribute_df, on = "PID", how="left")
full_df["PID"] = attribute_df["PID"].values
full_df["attribute_clustering_label"] = labels
return full_df
def get_clustering(self, attributes_2D_mapping):
assert attributes_2D_mapping.shape[1] ==2
new_labels = hdbscan.approximate_predict(clusterer, attributes_2D_mapping)
return new_labels
UMAP_transformer = UMAP_embedder()
plot_df = data_allyears.merge(bert_embedding[["PID"]], on = "PID")
filter_cond = plot_df["Budget_Change_Ratio"].replace([-np.inf, np.inf], np.nan).notnull()
color = plot_df["Borough"][filter_cond]
symbol = plot_df["Category"][filter_cond]
size = np.log(plot_df["Budget_Change_Ratio"][filter_cond].abs() + 1)
UMAP_description_all_projects = UMAP_transformer.get_mapping_description(bert_embedding, dimensions=["2D"]).drop(columns="description_embedding_original")
fig = px.scatter(x=UMAP_description_all_projects[filter_cond]["umap_descr_2D_embed_1"], y=UMAP_description_all_projects[filter_cond]["umap_descr_2D_embed_2"],
size=size,
symbol= symbol,
color = color, opacity=0.8,
color_continuous_scale=px.colors.diverging.Portland_r,
render_mode= "svg",
labels={"color":color.name, "size": size.name, "symbol":symbol.name, "x": "UMAP component 1", "y": "UMAP component 2"},
title= "2D UMAP projections of all projects description's BERT embeddings.\nColor= 'Borough', Size = 'Budget Change Ratio', Symbol= 'Category'",
width=1500, height=1000)
fig.update_layout({
'plot_bgcolor': 'rgba(0, 0, 0, 0)',
'paper_bgcolor': 'rgba(0, 0, 0, 0)',
})
SVG(fig.to_image(format= "svg"))
data_train.head()
plot_df = data_train
filter_cond = data_train["PID"].notnull()
color = plot_df["Schedule_Change_Ratio"][filter_cond]
symbol = plot_df["Category"][filter_cond]
size = np.log(plot_df["Budget_Change_Ratio"][filter_cond].abs() + 1)
UMAP_description_all_projects = UMAP_transformer.get_mapping_description(bert_embedding, dimensions=["2D"]).drop(columns="description_embedding_original")
fig = px.scatter(x=data_train[filter_cond]["umap_attributes_2D_embed_1"], y=data_train[filter_cond]["umap_attributes_2D_embed_2"],
size=size,
symbol= symbol,
color = color, opacity=0.8,
color_continuous_scale=px.colors.diverging.Portland_r,
render_mode= "svg",
labels={"color":color.name, "size": f"Log {size.name}", "symbol":symbol.name, "x": "UMAP component 1", "y": "UMAP component 2"},
title= "2D UMAP projections of train set projects attributes.\nColor= 'Borough', Size = 'Budget Change Ratio', Symbol= 'Category'",
width=1000, height=800)
fig.update_layout({
'plot_bgcolor': 'rgba(0, 0, 0, 0)',
'paper_bgcolor': 'rgba(0, 0, 0, 0)',
})
fig.update_layout(legend_orientation="h")
SVG(fig.to_image(format= "svg"))
plot_df = data_test
filter_cond = data_train["PID"].notnull()
color = plot_df["Schedule_Change_Ratio"][filter_cond]
symbol = plot_df["Category"][filter_cond]
size = np.log(plot_df["Budget_Change_Ratio"][filter_cond].abs() + 1)
UMAP_description_all_projects = UMAP_transformer.get_mapping_description(bert_embedding, dimensions=["2D"]).drop(columns="description_embedding_original")
fig = px.scatter(x=plot_df[filter_cond]["umap_attributes_2D_embed_1"], y=plot_df[filter_cond]["umap_attributes_2D_embed_2"],
size=size,
symbol= symbol,
color = color, opacity=0.8,
color_continuous_scale=px.colors.diverging.Portland_r,
render_mode= "svg",
labels={"color":color.name, "size": f"Log {size.name}", "symbol":symbol.name, "x": "UMAP component 1", "y": "UMAP component 2"},
title= "2D UMAP projections of train set projects attributes.\nColor= 'Borough', Size = 'Budget Change Ratio', Symbol= 'Category'",
width=1000, height=800)
fig.update_layout({
'plot_bgcolor': 'rgba(0, 0, 0, 0)',
'paper_bgcolor': 'rgba(0, 0, 0, 0)',
})
fig.update_layout(legend_orientation="h")
SVG(fig.to_image(format= "svg"))
set(data_train.Category).symmetric_difference(set(data_test.Category))
set(data_train.Borough).symmetric_difference(set(data_test.Borough))
plot_df = data_train
filter_cond = data_train["PID"].notnull()
color = plot_df["attribute_clustering_label"][filter_cond]
symbol = plot_df["Category"][filter_cond]
size = np.log(plot_df["Budget_Change_Ratio"][filter_cond].abs() + 1)
UMAP_description_all_projects = UMAP_transformer.get_mapping_description(bert_embedding, dimensions=["2D"]).drop(columns="description_embedding_original")
fig = px.scatter(x=plot_df[filter_cond]["umap_attributes_2D_embed_1"], y=plot_df[filter_cond]["umap_attributes_2D_embed_2"],
size=size,
symbol= symbol,
color = color, opacity=0.8,
color_continuous_scale=px.colors.diverging.Portland_r,
render_mode= "svg",
labels={"color":color.name, "size": f"Log {size.name}", "symbol":symbol.name, "x": "UMAP component 1", "y": "UMAP component 2"},
title= "2D UMAP projections of train set projects attributes.\nColor= 'HDBSCAN clustering Label', Size = 'Budget Change Ratio', Symbol= 'Category'",
width=1000, height=800)
fig.update_layout({
'plot_bgcolor': 'rgba(0, 0, 0, 0)',
'paper_bgcolor': 'rgba(0, 0, 0, 0)',
})
fig.update_layout(legend_orientation="h")
SVG(fig.to_image(format= "svg"))
from math import pi
def make_spider(mean_peaks_per_cluster, row, title, color):
# number of variable
categories=list(mean_peaks_per_cluster)[1:]
N = len(categories)
# What will be the angle of each axis in the plot? (we divide the plot / number of variable)
angles = [n / float(N) * 2 * pi for n in range(N)]
angles += angles[:1]
# Initialise the spider plot
ax = plt.subplot(4,4,row+1, polar=True, )
# If you want the first axis to be on top:
ax.set_theta_offset(pi / 2)
ax.set_theta_direction(-1)
# Draw one axe per variable + add labels labels yet
plt.xticks(angles[:-1], categories, color='grey', size=8)
# Draw ylabels
ax.set_rlabel_position(0)
#plt.yticks([10,20,30], ["10","20","30"], color="grey", size=7)
#plt.ylim(0,40)
# Ind1
scaled = mean_peaks_per_cluster.loc[row].drop('group').values
values=mean_peaks_per_cluster.loc[row].drop('group').values.flatten().tolist()
values += values[:1]
ax.plot(angles, values, color=color, linewidth=2, linestyle='solid')
ax.fill(angles, values, color=color, alpha=0.4)
# Add a title
plt.title(title, size=11, color=color, y=1.1)
# ------- PART 2: Apply to all individuals
# initialize the figure
my_dpi=50
plt.figure(figsize=(1000/my_dpi, 1000/my_dpi), dpi=my_dpi + 40)
# Create a color palette:
my_palette = plt.cm.get_cmap("Set2", len(radar_plots_df.index))
# Loop to plot
for row in range(0, len(radar_plots_df.index)):
make_spider( radar_plots_df, row=row, title='group '+radar_plots_df['group'][row].astype("str"), color=my_palette(row))
plt.tight_layout()
px.histogram(test,x="Budget_metric_value",color = "label", facet_row = "Budget_metric", height=2000)
px.histogram(test,x="Schedule_metric_value",color = "label", facet_row = "Schedule_metric", height=2000)
The unabridged notebook used to generate the findings in this section can be found here on GitHub.
IN THIS SECTION:
3.3.1. PCA dimension-reduced encoding of Bert embedded text
3.3.2. Autoencoder dimension-reduced encoding of Bert embedded text
3.3.3. UMAP dimension-reduced encoding of Bert embedded text
The unabridged notebook used to generate the findings in this section can be found here on GitHub.
The unabridged notebook used to generate the findings in this section can be found here on GitHub.
The unabridged notebook used to generate the findings in this section can be found here on GitHub.
plot_df = data_allyears.merge(bert_embedding[["PID"]], on = "PID")
filter_cond = plot_df["Budget_Change_Ratio"].replace([-np.inf, np.inf], np.nan).notnull()
color = plot_df["Borough"][filter_cond]
symbol = plot_df["Category"][filter_cond]
size = np.log(plot_df["Budget_Change_Ratio"][filter_cond].abs() + 1)
UMAP_description_all_projects = UMAP_transformer.get_mapping_description(bert_embedding, dimensions=["2D"]).drop(columns="description_embedding_original")
fig = px.scatter(x=UMAP_description_all_projects[filter_cond]["umap_descr_2D_embed_1"], y=UMAP_description_all_projects[filter_cond]["umap_descr_2D_embed_2"],
size=size,
symbol= symbol,
color = color, opacity=0.8,
color_continuous_scale=px.colors.diverging.Portland_r,
render_mode= "svg",
labels={"color":color.name, "size": size.name, "symbol":symbol.name, "x": "UMAP component 1", "y": "UMAP component 2"},
title= "2D UMAP projections of all projects description's BERT embeddings.\nColor= 'Borough', Size = 'Budget Change Ratio', Symbol= 'Category'",
width=1500, height=1000)
fig.update_layout({
'plot_bgcolor': 'rgba(0, 0, 0, 0)',
'paper_bgcolor': 'rgba(0, 0, 0, 0)',
})
SVG(fig.to_image(format= "svg"))
plot_df = data_allyears.merge(bert_embedding[["PID"]], on = "PID")
filter_cond = plot_df["Budget_Change_Ratio"].replace([-np.inf, np.inf], np.nan).notnull()
color = plot_df["Borough"][filter_cond]
symbol = plot_df["Category"][filter_cond]
size = plot_df["Schedule_Change_Ratio"][filter_cond].abs()
UMAP_description_all_projects = UMAP_transformer.get_mapping_description(bert_embedding, dimensions=["2D"]).drop(columns="description_embedding_original")
fig = px.scatter(x=UMAP_description_all_projects[filter_cond]["umap_descr_2D_embed_1"], y=UMAP_description_all_projects[filter_cond]["umap_descr_2D_embed_2"],
size=size,
symbol= symbol,
color = color, opacity=0.8,
color_continuous_scale=px.colors.diverging.Portland_r,
render_mode= "svg",
labels={"color":color.name, "size": size.name, "symbol":symbol.name, "x": "UMAP component 1", "y": "UMAP component 2"},
title= "2D UMAP projections of all projects description's BERT embeddings.\nColor= 'Borough', Size = 'Schedule Change Ratio', Symbol= 'Category'",
width=1500, height=1000)
fig.update_layout({
'plot_bgcolor': 'rgba(0, 0, 0, 0)',
'paper_bgcolor': 'rgba(0, 0, 0, 0)',
})
SVG(fig.to_image(format= "svg"))
X_cols = [
'Budget_Start',
'Duration_Start',
'Bridges',
'Ferries',
'Industrial_Development',
'Parks',
'Sanitation',
'Schools',
'Sewers',
'Streets_and_Roadways',
'Wastewater_Treatment',
'Water_Supply',
'Category_Code',
'umap_descr_2D_embed_1',
'umap_descr_2D_embed_2',
'umap_attributes_2D_embed_1',
'umap_attributes_2D_embed_2',
'attribute_clustering_label',
'ae_descr_embed_1',
'ae_descr_embed_2',
'pca_descr_embed_1',
'pca_descr_embed_2',
'attribute_km3_label'
]
y_cols = [
'Budget_Change_Ratio',
'Schedule_Change_Ratio'
]
X_train, y_train = data_train[X_cols], data_train[y_cols]
X_test, y_test = data_test[X_cols], data_test[y_cols]
print('{}\t{}'.format(X_train.shape, X_test.shape))
print('{}\t{}'.format(y_train.shape, y_test.shape))
X_train.info()
print()
y_train.info()
X_train.describe()
#######################################
# CREATE SCALED DATAFRAMES
#######################################
# Identify binary variable columns to exclude from scaling
exclude_scale_cols = list(X_train)[2:]
# Standardize both X_train and X_test data, fitting X_train as the
# scaler for both
scaler = StandardScaler
scale_before_func = None
scale_after_func = None
reapply_scaler = False
X_train_std, Scaler_std = scale_features(
X_train, X_train,
exclude_scale_cols,
scaler,
scale_before_func,
scale_after_func,
reapply_scaler,
)
X_test_std, _ = scale_features(
X_train, X_test,
exclude_scale_cols,
scaler,
scale_before_func,
scale_after_func,
reapply_scaler,
)
# Standardize X_train and X_test, pass through sigmoid transformation
# and re-standardize to minimize skew of data
scaler = StandardScaler
scale_before_func = None
scale_after_func = sigmoid
reapply_scaler = True
X_train_std_sig, Scaler_std_sig = scale_features(
X_train, X_train,
exclude_scale_cols,
scaler,
scale_before_func,
scale_after_func,
reapply_scaler,
)
X_test_std_sig, _ = scale_features(
X_train, X_test,
exclude_scale_cols,
scaler,
scale_before_func,
scale_after_func,
reapply_scaler,
)
# inspect scaled datasets
plot_bdgt_sched_scaled(X_train, X_train_std, 'Standardized', X_test, X_test_std)
plot_bdgt_sched_scaled(X_train, X_train_std_sig, 'Sigmoid standardized', X_test, X_test_std_sig)
FINDINGS:
By visualizing our Budget_Start and Duration_Start predictors above, we can see a large skew with clear outliers in the original unscaled data. By applying standardization to the these predictors, as we have illustrated in the upper righthand plot, we have set both variables to the same scale. However, standardizing has not alleviated the skewness of our data or helped with our outlying datapoints.
Therefore, we have also applied a sigmoid transformation to the data and re-standardized, as is shown in the lower righthand plot. This sigmoid transformation has helped to alleviate the skew of our data, and it has also helped to more evenly distrubute all of our data points, drawing outliers far closer to the center of the distribution.
Now we will fit a "Baseline" linear regression model on our scaled datasets to see which performs best.
The unabridged notebook used to generate the findings in this section can be found here on GitHub.
Budget_Start and Duration_Start predictors¶features = [
'Budget_Start',
'Duration_Start'
]
print(
'\nThese 2 "BASELINE" models used the following predictors:\n\n\t{}\n\n'\
''.format(features)
)
sm_formulas = [
' + '.join(features),
' + '.join(features)
]
model_descr = 'Baseline linear regression, standardized data, 2 predictors'
model_LR2 = generate_model_dict(
sm.ols,
model_descr,
X_train_std[features], X_test_std[features], y_train, y_test,
multioutput=True,
verbose=False,
predictions=True,
scores=True,
model_api='statsmodels',
sm_formulas=sm_formulas,
)
print_model_results(model_LR2)
model_descr = 'Baseline linear regression, sigmoid standarized data, 2 predictors'
model_LR2_sig = generate_model_dict(
sm.ols,
model_descr,
X_train_std_sig[features], X_test_std_sig[features], y_train, y_test,
multioutput=True,
verbose=False,
predictions=True,
scores=True,
model_api='statsmodels',
sm_formulas=sm_formulas,
)
print_model_results(model_LR2_sig)
FINDINGS:
By using sigmoid scaled data for both our Budget_Start and Schedule_Start predictors, we can see that by reducing skew in our predictors, we have generated slightly better performance in predicting both Budget_Change_Ratio and Schedule_Change_Ratio in both our train and test sets.
While these are just true Baseline models in which we use Linear Regression with only 2 predictors, the results indicate that:
Budget_Change_Ratio may prove more difficult to predict than Schedule_Change_Ratio, wherein our predictions for Budget_Change_Ratio perform less well than a naive model as is indicated by the negative $R^2$ score for the test data.Category as a predictor¶features = list(X_train)[:-10]
print(
'\nThis "BASELINE+" model uses the project Category '\
'as one-hot-encoded predictors:'\
'\n\n\t{}\n\n'.format(features)
)
sm_formulas = [
' + '.join(features),
' + '.join(features)
]
model_descr = 'Baseline linear regression, sigmoid standarized data, with categories'
model_LR3 = generate_model_dict(
sm.ols,
model_descr,
X_train_std_sig[features], X_test_std_sig[features], y_train, y_test,
multioutput=True,
verbose=False,
predictions=True,
scores=True,
model_api='statsmodels',
sm_formulas=sm_formulas,
)
print_model_results(model_LR3)
FINDINGS:
By adding the one-hot-encoded Category feature to our model, which provides information about the type of project for each observation, we can see a clear improvement in our Schedule_Change_Ratio $R^2$ scores for both train and test predictions.
Our Budget_Change_Ratio results on the otherhand have degraded in performance when compared to our simpler Baseline model with just 2 predictors. Regardless, neither the Baseline nor Baseline+ linear regression models are able to predict 3-year Budget_Change_Ratio results better than the naive model, as is indicated by these $R^2$ results.
This indicates to us that we will likely have more difficulty in predicting Budget_Change_Ratio in our future models and that a Linear Regression model likely lacks the expressiveness required to adequately fit a model to the underlying relationship between predictors and outcome variable.
As on last step before moving on from Linear Regression, we will quickly inspect the predictions made by our Baseline+ model, as well as the regression coefficients.
plot_true_pred(model_dict=model_LR2, dataset='train')
plot_true_pred(model_dict=model_LR2, dataset='test')
ADD PLOT OF COEFFICIENTS & INTERPRETATION
IN THIS SECTION:
The unabridged notebook used to generate the findings in this section can be found here on GitHub.
features = [
'Budget_Start',
'Duration_Start',
'Category_Code',
]
print(
'\nThis smoothing spline GAM uses the same predictors as our '\
'"BASELINE+" regression model,\nexcept Category is label-encoded '\
'instead of one-hot-encoded:'\
'\n\n\t{}\n\n'.format(features)
)
model_descr = 'Smoothing spline GAM, sigmoid standarized data, with categories'
terms = s(0) + s(1) + f(2)
model_GAM0 = generate_model_dict(
LinearGAM,
model_descr,
X_train_std_sig[features], X_test_std_sig[features], y_train, y_test,
multioutput=False,
verbose=False,
predictions=True,
scores=True,
model_api='sklearn',
terms=terms,
)
print_model_results(model_GAM0)
FINDINGS:
Here we can see that we stand to benefit from the added expressiveness of a smoothing spline class of linear model, as is illustrated by the improved $R^2$ results shown above. However, while our Schedule_Change_Ratio predictions have improved from $R^2=0.34$ to $R^2=0.58$, our Budget_Change_Ratio test $R^2$ score is still extremely negative. However, the smoothing spline GAM does appear to have fitted the training set with a Budget_Change_Ratio $R^2$ score of 0.54, which is somewhat promising. Now, if only we can improve on this to find a model that generalizes when to unseen data.
gridsearch method to choose our values $\lambda$.terms = s(0) + s(1) + f(2)
# generate list of lambdas against which to perform gridsearch for
# each outcome variable and each of our 3 input predictors
lam_list = np.logspace(-3, 5, 10)
lams = [lam_list] * 3
# fit GAM to predict budget change ratio
gam1 = LinearGAM(terms).fit(
X_train_std_sig[features], y_train['Budget_Change_Ratio']
)
# perform gridsearch to find optimal lambdas for each term
gam1.gridsearch(
X_train_std_sig[features], y_train['Budget_Change_Ratio'],
lam=lams
)
# fit GAM to predict schedule change ratio
gam2 = LinearGAM(terms).fit(
X_train_std_sig[features], y_train['Schedule_Change_Ratio']
)
# perform gridsearch to find optimal lambdas for each term
gam2.gridsearch(
X_train_std_sig[features], y_train['Schedule_Change_Ratio'],
lam=lams
)
%%capture --no-stdout
print(
'\nGAM gridsearch results for BUDGET_CHANGE_RATIO prediction model:\n'
)
gam1.summary()
print(
'\n\n\nGAM gridsearch results for SCHEDULE_CHANGE_RATIO prediction model:\n'
)
gam2.summary()
ADD INTERPRETATION
print(
'\nThese smoothing spline GAMs have been fit using the optimal lambda penalties '\
'for each term, one set of results show models optimized for BUDGET_CHANGE_RATIO '\
'predictions, the other for SCHEDULE_CHANGE_RATIO predictions.\n\n'\
'The predictors used are:\n\n\t{}\n\n'.format(features)
)
model_descr = 'Smoothing spline GAM: sigmoid scaled, BUDGET_CHANGE_RATIO optimal penalties'
terms = s(0, lam=0.001) + s(1, lam=100000) + f(2, lam=215.4435)
model_GAM1 = generate_model_dict(
LinearGAM,
model_descr,
X_train_std_sig[features], X_test_std_sig[features], y_train, y_test,
multioutput=False,
verbose=False,
predictions=True,
scores=True,
model_api='sklearn',
terms=terms,
)
print_model_results(model_GAM1)
model_descr = 'Smoothing spline GAM: sigmoid scaled, SCHEDULE_CHANGE_RATIO optimal penalties'
terms = s(0, lam=100000) + s(1, lam=27.8256) + f(2, lam=3.5938)
model_GAM2 = generate_model_dict(
LinearGAM,
model_descr,
X_train_std_sig[features], X_test_std_sig[features], y_train, y_test,
multioutput=False,
verbose=False,
predictions=True,
scores=True,
model_api='sklearn',
terms=terms,
)
# generate_model_dict(
# LinearGAM, model_descr, X_le_tr_std_sig, X_le_te_std_sig, y_train, y_test, terms=terms, multioutput=False
# )
print_model_results(model_GAM2)
ADD INTERPRETATION
y_pred_train = np.hstack(
[
model_GAM1['predictions']['train'][:, 0].reshape(-1,1),
model_GAM2['predictions']['train'][:, 1].reshape(-1,1)
]
)
y_pred_test = np.hstack(
[
model_GAM1['predictions']['test'][:, 0].reshape(-1,1),
model_GAM2['predictions']['test'][:, 1].reshape(-1,1)
]
)
Conclusionsmodel_descr = "Smoothing spline GAM: standardized sigmoid scale data and optimized $\lambda$'s"
y1_label = 'Budget Change Ratio'
y2_label = 'Schedule Change Ratio'
plot_true_pred(
dataset='train',
y_true=y_train,
y_pred=y_pred_train,
y1_label=y1_label,
y2_label=y2_label,
model_descr=model_descr
)
plot_true_pred(
dataset='test',
y_true=y_test,
y_pred=y_pred_test,
y1_label=y1_label,
y2_label=y2_label,
model_descr=model_descr
)
ADD INTERPRETATION
plot_gam_by_predictor(
model_dict=model_GAM1, model_index=0,
X_data=X_train_std_sig[features], y_data=y_train,
dataset='train'
)
print()
plot_gam_by_predictor(
model_dict=model_GAM2, model_index=1,
X_data=X_train_std_sig[features], y_data=y_train,
dataset='train'
)
ADD INTERPRETATION & CONCLUSIONS
The unabridged notebook used to generate the findings in this section can be found here on GitHub.
IN THIS SECTION:
The unabridged notebook used to generate the findings in this section can be found here on GitHub.
The unabridged notebook used to generate the findings in this section can be found here on GitHub.
import shap
import glob
shap.initjs()
model_dict = {}
for file in glob.glob("../data/interim/pickled_models/"):
model_dict[file] =
The .csv file used to generate this data dictionary can be found here on GitHub.